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Abstract — As  more  people  browse  the  Web  to  gather  informa¬ 
tion,  recognizing  Web  browsing  behavior  signatures  can  replace 
or  complement  keystroke  authentication  where  authentication  is 
defined  as  the  capability  of  identifying  an  individual  within  a 
set  of  individuals.  We  claim  that  recurring  temporal  patterns  of 
Web  site  visits  can  help  identify  an  individual  of  interest  and, 
more  generally,  categorize  Web  browsing  behavior.  Furthermore, 
just  like  keystroke  authentication,  attribution  of  Web  behavior 
is  not  obtrusive  and  has  applications  in  cyberwarfare  as  a  new 
biometric  technique.  In  this  paper  we  describe  some  exploratory 
work  and  preliminary  comparative  results  of  machine  learning 
techniques  applicable  to  the  attribution  of  Web  browsing  behav¬ 
ior  problem. 

I.  Introduction 

Anonymity  is  difficult  to  preserve  and  might  not  be 
completely  possible.  Rather,  it  is  possible  to  hide  behind 
pseudonymity  and  unobservability  when  interacting  on  the 
Web.  While  pseudonimity  is  the  capability  to  hide  behind  a 
false  identity,  unobservability  is  the  capability  to  hide  in  plain 
sight  [1],  Together,  pseudonymity  and  unobservability  make 
a  powerful  recipe  for  quasi-anonymity.  Consequently,  there  is 
a  need  in  cyberspace  to  know  who  our  attackers  are  and  to 
track  them  wherever  they  might  be  on  the  network  -  a  problem 
known  as  attribution. 

Marketers  have  long  been  interested  in  understanding  Web 
interaction  behavior  [2],  [3],  [4]  in  order  to  design  Web  sites 
that  entice  visitors  to  finish  their  Web  session  with  a  checkout 
of  their  shopping  cart.  Behavioral  targeting  is  an  approach  used 
by  advertisers  (e.g.,  Doubleclick)  that  track  Web  behavior 
to  deliver  advertisements  matching  an  individual’s  profile. 
Research  in  this  area  has  concentrated  on  identifying  the 
demographic  characteristics  of  a  behavior  such  as  age  and 
gender  rather  than  authenticating  a  single  individual  [5].  There 
has  also  been  some  research  on  understanding  online  browsing 
behavior  from  an  aggregate  perspective  in  order  to  identify 
influential  websites  in  user  navigation  patterns  [6], 

Section  II  reviews  the  related  work  in  the  area  of  attribution 
in  cyberspace.  Section  III  introduces  our  proposed  methodol¬ 
ogy  in  this  area  with  some  exploratory  evaluation  in  Section 
IV.  Finally,  we  conclude  in  Section  V  with  our  direction  for 
research. 

II.  Related  Work 

The  attribution  problem  in  cyberspace  has  been  addressed 
in  several  ways  mainly  by  leveraging  from  features  in  the 
browser  (e.g.,  history  stealing,  cookies,  etc.)  or  accessing 


datasets  containing  partially  identifying  information.  For  ex¬ 
ample,  de-anonymization  in  social  networking  websites  has 
been  accomplished  by  taking  the  intersection  of  users  from 
group  memberships  in  a  social  network  through  information 
from  hyperlinks  in  the  browser  history  and  knowledge  about 
those  groups  [7].  In  general,  unique  identification  is  possible 
by  cross-referencing  independent  information  sets  containing 
partial  information  with  a  universal  set  in  a  manner  equivalent 
to  a  database  join  (also  known  as  “linkage  attacks”).  For 
example,  it  has  been  possible  to  link  medical  records  to  indi¬ 
viduals  in  voter  registration  records  [8],  Some  success  has  been 
reported  with  the  classification  of  global  syntactic  features  of 
a  Web  session  (e.g.  length  of  session,  average  time  on  a  page, 
etc.)  per  user  [9]  but  this  mode  of  identification  is  easy  to 
defeat  and  only  serves  as  a  proof  of  concept  that  signature 
identification  is  possible  with  data  aggregated  over  several 
sessions.  It  has  also  been  shown  that  authorship  of  content  can 
be  determined  from  stylometric  features  on  an  internet  scale 
threatening  anonymity  [10]  but  this  type  of  attribution  depends 
on  published  content.  Research  in  predicting  user  behavior  in 
cyberspace  has  also  been  directed  toward  improving  tasks  such 
as  information  retrieval  [11]  or  desktop  assistance  [12],  For 
example,  based  on  the  content  of  the  current  Web  page  and  a 
user’s  original  search  keywords,  the  most  relevant  hyperlinks 
in  the  page  are  highlighted  to  guide  selection  of  the  next 
page  to  visit.  This  type  of  prediction  is  oriented  toward  the 
information  presented  in  context  to  the  user  rather  than  the 
specific  activity  that  a  user  might  pursue  (e.g.  send  an  email, 
read  a  paper,  etc.). 

In  contrast  to  previous  approaches,  we  address  the  attri¬ 
bution  problem  by  leveraging  both  from  syntactic  patterns  in 
Web  browsing  history  and  the  semantic  content  of  this  history. 

III.  Proposed  Methodology 

Our  approach  to  the  tracking  and  authenticating  of  Web 
browsing  behavior  involves  the  following  tasks: 

1)  Classify  Web  pages  into  genres  at  multiple  levels  of 
granularity; 

2)  Encode  temporal  sequences  of  individual  Web  browsing 
behavior  consisting  of  Web  page  genres  and  pauses  from 
clickstream  data; 

3)  Learn  Web  behavior  signatures  (profiles)  with  structured 
prediction; 

4)  Recognize  an  individual  or  typical  behavior  or  report 
unknown. 

In  order  to  acquire  the  relevant  clickstream  data,  we  have 
developed  a  Firefox  browser  extension  to  be  used  in  an 
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Figure  1.  Web  behavior  attribution  data  flow 


ethnographic  user  study  which  has  been  approved  by  our  in¬ 
stitutional  review  board.  Other  datasets  will  also  be  generated 
from  public  social  media.  The  overall  dataflow  is  illustrated 
in  Fig.  1.  We  claim  that  genres,  as  functional  categories  of 
information  presentation,  are  more  indicative  of  Web  browsing 
behavior  than  content  alone.  For  example,  it  takes  longer  to 
read  a  research  paper  for  certain  persons.  Identifying  Web 
browsing  signatures  is  a  search  through  abstraction  spaces: 
the  varying  degree  of  generality  in  the  semantic  content  of  the 
Web  pages  and  the  syntactic  pattern  of  the  behavior  (Fig.  2) 
necessary  to  uniquely  identify  someone.  To  borrow  an  analogy 
from  natural  language  processing,  taking  a  set  of  genres  as 
our  lexicon,  a  Web  browsing  history  as  a  natural  language 
sentence  and  a  profile  as  an  encoded  parse  tree,  the  problem  of 
attribution  reduces  to  the  task  of  identifying  the  best  matching 
parse  tree  most  suited  for  a  given  sentence. 
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Figure  2.  Search  through  abstraction  spaces  for  Web  signatures.  Each 
signature  is  a  dot  in  the  problem  space 


A.  Structured  Prediction 

Structured  prediction  is  a  supervised  learning  method  that 
addresses  problems  where  the  output  itself  is  complex  [13]. 


For  example,  the  output  can  be  in  the  form  of  a  tree,  a  sequence 
or  a  graph.  Structured  prediction  is  being  activately  researched 
in  natural  language  processing  to  predict  parse  trees  from 
sentences  and  in  sequence  matching  problems  such  as  machine 
translation  [14].  For  example,  when  translating  a  sentence  from 
English  to  French,  a  word  for  word  translation  is  not  enough 
because  it  ignores  correlations  and  constraints  among  words. 
What  has  to  be  predicted  here  is  a  set  of  mappings  and  not  just 
individual  mappings.  Because  of  this  selective  matching  be¬ 
tween  two  sets,  the  input  and  output  set,  structured  prediction 
methods  must  solve  a  combinatorial  optimization  problem. 
Traditional  classification  methods  make  local  predictions  but 
(1)  what  has  to  be  predicted  is  different  from  the  sum  of 
the  parts  and  (2)  the  constraints  and  correlations  between  the 
output  features  themselves  can  help  improve  the  prediction 
of  the  parts.  Formally,  structured  prediction  involves  learning 
a  mapping  from  complex  inputs  x  £  X  to  complex  outputs 
y  £  Y  from  a  training  sample  of  input-output  pairs  (.x1, .  y, ) 
drawn  from  an  unknown  distribution.  There  are  dependencies 
between  Web  page  requests  that  extend  beyond  the  last  page 
visited  due  to  the  non-linearity  (hypertext)  of  Web  content  or 
inadvertent  page  clicks. 

Hidden  Markov  Models  (HMMs)  [15]  have  long  been  the 
traditional  method  to  model  behavior  from  observations  but 
they  are  limited  in  their  capability  to  represent  constraints 
between  any  two  states  of  the  output  because  of  their  Markov 
assumption  and  the  independence  assumption  of  the  observa¬ 
tions.  In  addition,  HMMs  become  intractable  when  the  ob¬ 
servations  are  not  enumerable.  Although  work  has  been  done 
on  overcoming  those  limitations  resulting  in  complex  models, 
structured  prediction  methods  provide  a  unified  framework  to 
predict  and  learn  arbitrary  activity  patterns.  In  addition,  HMMs 
require  enough  training  data  to  obtain  an  accurate  generative 
model  of  observations  while  structured  prediction  methods 
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leverages  only  from  dependencies  in  the  observation  sequence 
and  local  predictions  and  therefore  might  get  away  with 
much  less  data.  Structured  prediction  methods,  like  maximum 
entropy  Markov  models,  turn  HMMs  on  their  head  by  con¬ 
ditioning  the  probability  of  Y  on  the  observations  X  making 
it  possible  to  leverage  from  modern  discriminative  classifiers. 
Structured  prediction  methods  make  it  possible  to  solve  the 
HMMs’  decoding  problem  (i.e.,  finding  the  output  sequence  of 
hidden  states  from  the  input  sequence  of  observations)  without 
a  complete  model  capable  of  generating  the  observations  (i.e., 
HMMs’  learning  problem). 

B.  Approaches  to  Structured  Prediction 

There  are  several  approaches  to  solve  the  combinatorial 
problem  of  structured  prediction.  They  all  involve  a  change 
of  representation  that  includes  tentative  predictions  of  the 
output  sequence.  We  review  below  the  probabilistic  approach 
of  conditional  random  fields  (CRFs)  [16]  and  a  reinforcement 
learning  approach  [17], 

1 )  CRFs:  CRFs  are  predicated  on  the  interaction  of  neigh¬ 
boring  labels  in  a  sequence.  The  Ising  model,  formalized  by 
the  theory  of  random  fields,  describes  how  global,  emergent 
properties  emerge  from  local  interactions.  Similarly,  CRFs 
exploit  those  local  interactions  with  overlapping  features  from 
the  observations  X  and  the  labels  Y,  to  predict  the  most  likely 
label  sequence  conditioned  on  the  observations.  The  features 
express  correlations  and  dependencies  between  the  observation 
sequence  and  the  label  sequence,  among  the  observations 
themselves,  or  among  the  adjacent  labels  themselves.  A  feature 
Fj  (x,  y)  is  the  sum  of  binary  features  fj  describing  an  example 
x  £  X  of  length  n: 

n 

Fj(x,y)  =^2fj(yi-1,yi,x,i)  (1) 

2  —  1 


U(n,yn)  =  ma x[U(n-  l,yn-i)  +  9n{yn-i,  Vn)}  (4) 

Vn- 1 

In  our  specific  problem,  genre  classifications  can  still  be 
ambiguous  to  uniquely  associate  a  Web  page  genre  to  an 
individual.  For  example,  is  the  feature  that  a  Web  page 
visited  is  a  blog  or  a  political  blog  important  in  identifying 
an  individual?  CRFs  will  therefore  disambiguate  between 
genre  classifications  by  learning  their  feature  weights  while 
searching  for  the  best  sequence  of  activities  to  construct  an 
individual  profile  according  to  Eq.  4.  Figure  3  illustrates  the 
role  of  features  in  CRFs. 


Algorithm  1  Iterative  Viterbi  Algorithm  for  HMMs 

name :  viterbi 
input:  M,  %  observations 
sprobs,  %  state  S  probabilities 
trans,  %  transition  probabilities 
eprobs  %  emission  probabilities 
output :path  %  most  likely  state  sequence 
prob  %  path  probability 

t  <-0 

foreach  s  E  S 

a[s]  =  sprobs  (s)  *  eprobs  (mo,  s) 
path[t]  <—  argmaxs  ( a) 
t  <-  t+1 
M  M\  {rao  } 
foreach  m  E  M 
foreach  s  E  S 
maxval  -<—0 
foreach  s'  E  S 

temp  «—  a  [s']  *  trans  (s',  s) 
maxval  <—  max (temp, maxval) 
a'  [s]  <—  maxval  *  eprobs  (m,s) 
path[t]  <—  argmaxs  { a') 
a  ^ —  a ' 

return  path,  ^  a[s] 


For  example,  in  describing  Web  page  sequences,  such  a 
feature  fj  for  Web  page  x  at  position  i  could  be: 

{1  if  title  contains  race 

and  qenre  =  sports  news 
and  previous  genre  =  search  page 
0  otherwise 

(2) 

CRFs  overcome  the  problem  of  dynamic  length  sequences 
by  having  a  fixed  set  of  overlapping  features.  A  discriminative 
classifier,  such  as  logistic  regression,  can  then  be  used  to  learn 
the  weights  Wj  of  those  features  in  a  training  phase  to  obtain 
real-valued  features  gt  : 

j 

9i(yi-i,Vi)  =  (3) 

3  = 1 

The  score  U(n,yn)  of  the  entire  label  sequence  y  of 
length  n  ending  with  label  yn  can  then  be  computed  with 
the  following  recurrence  relation  [18]  computed  by  dynamic 
programming  algorithms  such  as  the  Viterbi  algorithm  (Alg. 
1): 


Figure  3.  CRF  feature  linking  temporal  states  yi—i  and  m  with  document 
features  x. 

2)  Reinforcement  Learning:  The  identification  of  a  most 
likely  sequence  of  states  was  shown  to  be  equivalent  to  finding 
an  optimal  policy  for  a  Markov  decision  process  given  a 
set  of  states  and  actions  by  considering  the  maximization  of 
expected  reward  as  the  minimization  of  empirical  loss  on  the 
output  training  sequence  [17].  The  actions  here  are  the  possible 
predictions  ;y,:  from  state  {x;.  y,_i }.  Bellman’s  optimality 
equation  (Eq.  5)  is  a  recurrence  relation  for  sequential  decision 
tasks  where  V*(s)  is  the  optimal  value  of  state  s,  a  is  the 
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action  executed  in  s  that  leads  to  s'  such  that  ,  Ps“ ,  =  1 
and  P°s,  =  P(s'\s,  a)  ,  a  £  A,  the  set  of  actions,  r  the  current 
reward,  and  7  the  discount  factor,  0  <  7  <  1,  weighting  future 
rewards. 

V" *  (s)  =  max  pss’  Ks'  +  lv  *  («')]  (5) 

a  L J 
s' 

The  transition  probabilities,  P“g/,  can  be  obtained  while 
training  using  model -based  reinforcement  learning  [19],  After 
training,  the  prediction  and  evaluation  of  the  most  likely 
sequence  for  each  user  can  be  made  by  following  the  optimal 
policy  mapping  states  to  actions  learned  for  this  user  with  Eq. 
4  where  gi{yn-i,yn)  =  V*(sn- 1).  In  this  framework,  the 
stepwise  reward  has  to  be  proportional  to  the  prediction  loss. 

Inverse  reinforcement  learning  (IRL)  [20]  addresses  the 
problem  of  learning  the  reward  function  corresponding  to  a  set 
of  trajectories  in  the  training  set  such  that  an  optimal  policy 
can  be  found  that  will  approximate  those  trajectories.  IRL 
methods  can  then  be  applied  to  structured  prediction  problems 
by  representing  the  search  through  the  output  space  in  the 
prediction  problem  as  a  sequential  decision  making  problem 

[21] ,  [13],  By  learning  a  specific  reward  function,  the  optimal 
policy  encapsulates  a  model  of  user  behavior. 

IV.  Proof  of  Concept  Evaluation 

Before  attempting  structured  induction  algorithms,  we  eval¬ 
uated  several  learning  algorithms  for  the  Web  browsing  identi¬ 
fication  task  with  the  discovery  challenge  dataset  from  ECML 

[22] ,  [23]  as  proof  of  concept.  In  this  dataset,  a  user  session 
is  characterized  by  its  timestamp,  a  sequence  of  page  visited, 
categorized  by  page  type,  and  number  of  page  views  (page 
loads)  (Table  I).  For  example,  the  sequence  of  page  type  visited 
given  as  “12,1  9,3  7,2”  corresponds  to  the  chronological  se¬ 
quence  “12,9,9,9,7,7”.  User  sessions  were  extracted  randomly 
from  the  training  and  test  sets  and  the  experiments  consisted 
of  distinguishing  them  correctly  in  the  test  data  based  on 
profiles  built  from  the  training  data.  The  training  data  averaged 
89  user  sessions  ranging  from  1-183  page  type  visited.  Four 
basic  types  of  algorithms  to  build  profiles  were  compared 
with  random  selection  based  on  the  class  distribution  during 
training  as  a  baseline: 

1)  Discrete  Markov  process  algorithm  based  on  profiles 
built  from  the  transition  probabilities  between  page  types 
for  each  user  during  training.  Each  page  type  observation 
Ot  corresponds  to  state  St.  The  profile  with  the  most 
likely  sequence  of  transition  probabilities  between  page 
types  was  selected  for  identification. 

2)  HMMs  maximum  likelihood  training  with  the  Baum- 
Welch  algorithm  implemented  using  Jahmm  [24].  Each 
user  profile  was  built  with  three  fully  interconnected 
hidden  states  with  initial  uniform  probabilities  to  model 
the  continuum  of  a  session  (beginning,  middle,  and 
end).  Unlike  discrete  Markov  processes,  the  page  type 
observation  Ot  is  decoupled  from  the  state  ,S)  in  an 
HMM.  The  profile  with  the  most  likely  sequence  of 
hidden  states  S  using  the  iterative  Viterbi  algorithm 
(Alg.  1)  was  then  selected  for  identification. 


3)  Classification  of  users  from  the  frequencies  of  page 
types  visited  with  the  decision  tree  algorithm  J48  from 
the  Weka  machine  learning  toolbench  [25]; 

4)  Classification  of  users  from  global  syntactic  features  of 
the  session  (number  of  pages  in  the  session,  average 
number  of  page  views,  session  length,  day  of  the  week 
and  time  of  day)  with  the  decision  tree  algorithm  J48; 

Figure  4  illustrates  the  comparative  results  obtained  using 
the  weighted  F-measure  in  discriminating  between  10  users 
for  each  individual  Web  session.  This  experiment  shows  that 
machine  learning  techniques  are  promising  in  this  domain, 
achieving  significant  performance  results  for  the  first  three  al¬ 
gorithms  taking  into  account  state  transitions,  state/observation 
probabilities,  and  page  types  while  the  performance  of  purely 
syntactic  patterns  degrades  rapidly.  The  research  will  consist 
of  scaling  up  those  results  with  structured  prediction  and  rein¬ 
forcement  learning  and  additional  information  from  webpages 
and  clickstream  data. 
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Figure  4.  Comparative  Results  for  discrimination  between  2-10  users  using 
weighted  F-measure 

V.  Conclusion 

We  claim  that  the  genre  of  Web  sites  visited  can  tell  us 
something  about  an  individual  browser  and  that  how  and  when 
the  browsing  was  done  is  also  revealing.  For  example,  the 
time  of  day  and  the  length  of  the  pause  at  a  page  can  give 
some  information.  We  have  shown  through  simple  experiments 
that  identification  of  users,  or  distinguishing  between  users, 
is  possible  within  a  certain  accuracy  but  that  performance 
degrades  rapidly  as  the  number  of  users  increases.  We  claim 
that  structured  prediction  will  allow  us  to  scale  up  by  lever¬ 
aging  additional  information  in  constructing  Web  behavior 
signatures.  Web  browsing  has  become  another  dimension 
of  human  activity  and  this  methodology  could  be  used  for 
continuous  or  periodic  identification  to  complement  an  initial 
strong  identification  technique  for  authentication. 
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